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Abstract 

We consider bottom-fc sampling for a set X, picking a sample Sk(X) consisting of the k elements 
that are smallest according to a given hash function h. With this sample we can estimate the frequency 
/ = of any subset Y as \S k {X) n Y\/k. A standard application is the estimation of the Jaccard 

similarity f = \Af] B\/\A U B\ between sets A and B. Given the bottom-fc samples from A and B, 
we construct the bottom-fc sample of their union as S k {A U B) = Sk(Sk(A) U Sk(B)), and then the 
similarity is estimated as \S k (A UB)fl S k (A) n S k {B)\/k. 

We show here that even if the hash function is only 2-independent, the expected relative error is 
0(1/ \fjk). For fk = 0(1) this is within a constant factor of the expected relative error with truly 
random hashing. 

For comparison, consider the classic approach of repeated min-wise hashing, where we use fc inde- 
pendent hash functions hi,..., h k , storing the smallest element with each hash function. For min-wise 
hashing, there can be a constant bias with constant independence, and this is not reduced with more 
repetitions fc. Recently Feigenblat et al. showed that bottom-fc circumvents the bias if the hash function 
is 8-independent and fc is sufficiently large. We get down to 2-independence for any fc. Our result is 
based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the 
bottom-fc sample, e.g., getting stronger probability error bounds with higher independence. 

For weighted sets, we consider priority sampling which adapts efficiently to the concrete input 
weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, 
but again we show that generic concentration bounds can be applied. 



1 Introduction 

The concept of min-wise hashing (or the "MinHash algorithm" according to Q ) is a basic algorithmic tool 
suggested by Broder et al. (UHl for problems related to set similarity and containment. After the initial 
application of this algorithm in the early Altavista search engine to detecting and clustering similar docu- 
ments, the scheme has reappeared in numerous other applications 1 and is now a standard tool in data mining 
where it is used for estimating similarity [6 8 9j, rarity lfT3ll . document duplicate detection 1171 1201 12211371 . 

etc Ennra. 

* A short version of this paper will appear at STOC 13. 

See |http : / /en . wikipedia . org/wiki/MinHash| 
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In an abstract mathematical view, we have two sets, A and B, and we are interested in understanding 



In order to do this by sampling, we need 



their overlap in the sense of the Jaccard similarity / = 
sampling correlated between the two sets, so we sample by hashing. Consider a hash function h : A U B — > 
[0, 1]. For now we assume that h is fully random, and has enough precision that collisions are avoided 
with probability 1. The main mathematical observation is that Pr[argmin/i(yl) = aigmmh(B)] is precisely 
/ = \A n B\ I \A U B\. Thus, we may sample the element with the minimal hash from each set, and 
use them in [argmin h(A) = argmin h(B)] for an unbiased estimate of /. Here, for a logical statement S, 
[S] = 1 if S is true; otherwise [S] = 0. 

For more concentrated estimators, we use repetition with k independent hash functions, hi,...,hk. For 
each set A, we store M k {A) = (argmin h\{ A), argmin hf. (A)), which is a sample with replacement 
from A. The Jaccard similarity between sets A and B is now estimated as \M k (A) n M k (B)\/k where 
\M k {A) n M k {B)\ denotes the number of agreeing coordinates between M k (A) and M k (B). We shall 
refer to this approach as repeated min-wise or kxmin. 

For our discussion, we consider the very related application where we wish to store a sample of a set X 
that we can use to estimate the frequency / = -jA of any subset Y C X. The idea is that the subset Y is 
not known when the sample from X is made. The subset Y is revealed later in the form of a characteristic 
function that can tell if (sampled) elements belong to Y. Using the fcxmin sample M k {X), we estimate the 
frequency as \M k (X) (~)Y\/k where \M k (X) n Y\ denotes the number of samples from M k (X) in Y. 

Another classic approach for frequency estimation is to use just one hash function h and use the k 
elements from X with the smallest hashes as a sample Sk{X). This is a sample without replacement from 
X. As in [12], we refer to this as a bottom-A; sample. The method goes back at least to |fl9l . The frequency 
of Y in X is estimated as \Y D Sk{X)\/k. Even though surprisingly fast methods have been proposed to 
compute fexmin Q, the bottom-A; signature is much simpler and faster to compute. In a single pass through 
a set, we only apply a single hash function h to each element, and use a max-priority queue to maintain the 
k smallest elements with respect to h. 

It is standard 1 to use bottom-A; samples to estimate the Jaccard similarity between sets A and B, for this 
is exactly the frequency of the intersection in the union. First we construct the bottom- k sample Sk(AuB) = 
Sk(Sk(A) U Sk{B)) of the union by picking the k elements from S k (A) U S k {B) with the smallest hashes. 
Next we return \S k {A) n S k (B) n S k (AuB)\/k. 

Stepping back, for subset frequency, we generally assume that we can identify samples from the subset. 
In the application to set similarity, it important that the samples are coordinated via hash functions, for this is 
what allows us to identify samples from the intersection as being sampled in both sets. In our mathematical 
analysis we will focus on the simpler case of subset frequency estimation, but it the application to set 
similarity that motivates our special interest in sampling via hash functions. 



Limited independence The two approaches fcxmin and bottom-A; are similar in spirit, starting from the 
same base lxmin = bottom- 1. With truly random hash functions, they have essentially the same rela- 
tive standard deviation (standard deviation divided by expectation) bounded by 1/ \fjk where / is the set 
similarity or subset frequency. The two approaches are, however, very different from the perspective of 
pseudo-random hash functions of limited independence ll36l : a random hash function h is d-independent if 
the hash values of any d given elements are totally random. 

With min-wise hashing, we have a problem with bias in the sense of sets in which some elements have 
a better than average chance of getting the smallest hash value. It is known that 1 + o(l) bias requires 
a; (1) -independence ll27l . This bias is not reduced by repetitions as in fexmin. However, recently Porat et 
al. lTT8l proved that the bias for bottom-A; vanishes for large enough k » 1 if we use 8-independent hashing. 
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Essentially they get an expected relative error of 0(l/y/Jk), and error includes bias. For fk = 0(1), this 
is only a constant factor worse than with truly random hashing. Their results are cast in a new framework of 
"d-fc-min-wise hashing", and the translation to our context is not immediate. 

Results In this paper, we prove that bottom- k sampling preserves the expected relative error of 0(1/ V fk) 
with 2-independent hashing, and this holds for any k including k = 1. We note that when fk = o(l), then 
1 / \fjk = w(l), so our result does not contradict a possible large bias for k = 1. 

We remark that we also get an 0(1/ W (1 — f)k) bound on the expected relative error. This is important 
if we estimate the dissimilarity 1 — / of sets with large similarity / = 1 — o(l). 

For the more general case of weighted sets, we consider priority sampling lfl7l which adapts near- 
optimally to the concrete input weights ll32l . e.g., benefiting strongly from heavy-tailed input. We show that 
2-independent hashing suffices for good confidence intervals. 

Our positive finding with 2-independence contrasts recent negative results on the insufficiency of low 
independence, e.g., that linear probing needs the 5 -independence l27l that was proved sufficient by Pagh et 
al. E51 . 

Implementation For 2-independent hashing we can use the fast multiplication-shift scheme from 04], 
e.g., if the elements are 32-bit keys, we pick two random 64-bit numbers a and b. The hash of key x is 
computed with the C-code (a * x + b) » 32, where * is 64-bit multiplication which discards overflow, 
and >> is a right shift. This is 10-20 times faster than the fastest known 8-independent hashing based on a 
degree 7 polynomial tuned for a Mersenne prime field [35 d. 

Practical relevance We note that Mitzenmacher and Vadhan [23 ] have proved that 2-independence gen- 
erally works if the input has enough entropy. However, the real world has lots of low entropy data. In 11351 
it was noted how consecutive numbers with zero entropy made linear probing with 2-independent hashing 
extremely unreliable. This was a problem in connection with denial-of-service attacks using consecutive IP- 
addresses. For our set similarity, we would have similar issues in scenarios where small numbers are more 
common, hence where set intersections are likely to be fairly dense intervals of small numbers whereas the 
difference is more likely to consists of large random outliers. Figure [Qpresents an experiment showing what 
happens if we try to estimating the dissimilarity with 2-independent hashing. 

Stepping back, the result Mitzenmacher and Vadhan is that 2-independence works for sufficiently ran- 
dom input. In particular, we do not expect problems to show up in random tests. However, this does not 
imply that 2-independent hashing can be trusted on real data unless we have special reasons to believe that 
the input has high entropy. In Figure [T] bottom- A; performs beautifully with 2-independent hashing, but no 
amount of experiments can demonstrate general reliability. However, the mathematical result of this paper 
is that bottom-/c can indeed be trusted with 2-independent hashing: the expected relative error is 0(l/y/Jk) 
no matter the structure of the input. 

Techniques To appreciate our analysis, let us first consider the trivial case where we are given a non- 
random threshold probability p and sample all elements that hash below p. As in fl6l we refer to this as 
threshold sampling. Since the hash of a element x is uniform in [0, 1], this samples x with probability p. The 
sampling of x depends only on the hash value of x, so if, say, the hash function is (^-independent, then the 

2 See Table 2 in [35 1 for comparisons with different key lengths and computers between multiplication-shift (Twolndep), and 
tuned polynomial hashing (CWtrick). The table considers polynomials of degree 3 and 4, but the cost is linear in the degree, so the 
cost for degree 7 is easily extrapolated. 
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Figure 1: Experiment with set consisting of 100300 32-bit keys. It has a "core" consisting of the consecutive 
numbers 1, ...100000. In addition it has 300 random "outliers". Using k samples from the whole set, we want 
to estimate the frequency of the outliers. The true frequency is 10 3 ° 3 ° 00 ps 0.003. We used k = 1, 100000 
in fcxmin and bottom- k and made one hundred experiments. For each k, we sorted the estimates, plotting 
the 10th and 90th value, labeled as 10% and 90% fractile in the figures. We also plotted the results from 
a single experiment. For readability, only one in every 100 values of k is plotted. Both schemes converge, 
but due to bias, fcxmin converges to a value that is 70% too large. Since bottom-/c does sampling without 
replacement, it becomes exact when the number of samples is the size of the whole set. The bias is a function 
of the structure of the subset within the whole set, e.g., the core set must have a negative bias complimenting 
the positive bias of the outliers. It is therefore not possible to correct for the bias if one only has the sample 
available. 
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number of samples is the sum of d-independent 0-1 variables. This scenario is very well understood (see, 
e.g., USED). 

We could set p = k/n, and get an expected number of k samples. Morally, this should be similar to 
a bottom-A; sample, which is what we get if we end up with exactly k samples, that is, if we end up with 
h(k) < P < fyfc+i) where denotes the ith smallest hash value. What complicates the situation is that 
h^) an d fyfc+i) are random variables depending on all the random hash values. 

An issue with threshold sampling is that the number of samples is variable. This is an issue if we have 
bounded capacity to store the samples. With k expected samples, we could put some limit K S> k on the 
number of samples, but any such limit introduces dependencies that have to be understood. Also, if we have 
room for K samples, then it would seem wasteful not fill it with a full bottom- K sample. 

Our analysis of bottom- A; samples is much simpler than the one in [18 1 for 8-independent hashing with 
k S> 1. With a union bound we reduce the analysis of bottom-A; samples to the trivial case of threshold 
sampling. Essentially we only get a constant loss in the error probabilities. With 2-independent hashing, 
we then apply Chebyshev's inequality to show that the expected relative error is 0(1/ yfjk). The error 
probability bounds are immediately improved if we use hash functions with higher independence. 

It is already known from [5 ] that we can use a 2-independent bottom-A; sample of a set to estimate its size 
n with an expected error of 0(^/n). The estimate is simply the inverse of the A;th smallest sample. Applying 
this to two 0(ra)-sized sets and their union, we can estimate \A\, \B\, \AuB\ and |^4nS| = | + — |AUi?| 
each with an expected error of 0(y/n). However, \A D B\ may be much smaller than 0(^/n). If we instead 
multiply our estimate of the similarity / = \A H -B|/|^4. U B\ with the estimate of \A U B\, the resulting 
estimate of \A D B\ is 

(1 ± 0(l/^/fk)f(\A UB\± 0{y/n)) = \ADB\± 0(y/\AnB). 

The analysis of priority sampling for weighted sets is much more delicate, but again, using union bounds, 
we show that generic concentration bounds apply. 

2 Bottom-k samples 

We are given a set of X of n elements. A hash function maps the elements uniformly and collision free into 
[0, 1]. Our bottom-A; sample S consists of the k elements with the lowest hash values. The sample is used to 
estimate the frequency / = |Y|/|Jf | of any subset Y of X as \Y n S\/k. With 2-independent hashing, we 
will prove the following error probability bound for any r < f = Vk/3: 

<4/r 2 . (1) 

The result is obtained via a simple union bound where stronger hash functions yield better error probabilities. 
With d-independence with d an even constant, the probability bound is 0(l/r d ). 

It is instructive to compare (i-independence with the idea of storing d independent bottom-A; samples, 
each based on 2-independence, and use the median estimate. Generally, if the probability of a certain 
deviation is p, the deviation probability for the median is bounded by (2ep) d / 2 , so the 4/r 2 from £T|) becomes 
(2e4/r 2 ) d / 2 < (5/r) d , which is the same type of probability that we get with a single d-independent hash 
function. The big advantage of a single (^-independent hash function is that we only have to store a single 
bottom-A; sample. 

If we are willing to use much more space for the hash function, then we can use twisted tabulation 
hashing ll28ll which is very fast, and then we get exponential decay in r though only down to an arbitrary 
polynomial the space used. 



Pr 



\YDS\- fk\ > r^fjk 
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In order to show that the expected relative error is 0(l/y/~fk), we also prove the following bound for 

fk < 1/4: 

Pv[\Yns\ >e] = o(fk/£ 2 + yff/e). (2) 

From (HJ and © we get 

Proposition 1 For bottom-k samples based on 2-independent hashing, a fraction f subset is estimated with 
an expected relative error of 0(l/i//fe). 

Proof The proof assumes (fl]) and ©. For the case fk > 1/4, we will apply (Q]). The statement is 
equivalent to saying that the sample error ||y fl S\ — fk\ in expectation is bounded by 0{^fjk). This 
follows immediately from CD) for errors below fy/Jk~ = kyJ\Y\/n. However, by (fl]), the probability of a 
larger error is bounded by 4/f 2 = 0(l/k). The maximal error is k, so the contribution to the expected error 
is O(l). This is 0{y/fk) since fk > 1/4. 

We will now handle the case fk < 1/4 using ((U). We want to show that the expected absolute error is 
0(yfjk). We note that only positive errors can be bigger than fk, so if the expected error is above 2fk, the 
expected number of samples from Y is proportional to the expected error. We have \J~fk > 2fk, so for the 
expected error bound, it suffices to prove that the expected number of samples is \ Y n S\ = 0(\f~Jk). Using 
(O for the probabilities, we now sum the contributions over exponentially increasing sample sizes. 

E[|y n s\] < ( 2i+1 Pr y n s \ > 2 i) 

i=0 
i=0 

= o(/fc + v7(i + ig*!)) =o(v?fc). 



2.1 A union upper bound 

First we consider overestimates. For positive parameters a and b to be chosen, we will bound the probability 
of the overestimate 

\YDS\>^fk. (3) 
1 — a 

Define the threshold probability 

k 

n(Y — a) 

Note that p is defined deterministically, independent of any samples. It is easy to see that the overestimate 
© implies one of the following two threshold sampling events: 

(A) The number of elements from X that hash below p is less than k. We expected pn = k/(l — a) 

elements, so A: is a factor (1 — a) below the expectation. 

(B) Y gets more than (1 + 6)_p|Y" | hashes below p, that is, a factor (1 + b) above the expectation. 
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To see this, assume that both (A) and (B) are false. When (A) is false, we have k hashes from X below p, so 
the largest hash in S is below p. Now if (B) is also false, we have at most (1 + Y| = (1 + b)/{l — a) ■ fk 
elements from Y hashing below p, and only these elements from Y could be in S. This contradicts ©. By 
the union bound, we have proved 

Proposition 2 The probability of the overestimate (0 is bounded by Pa + Pb where Pa and Pb are the 
probabilities of the events (A) and (B), respectively. 



Upper bound with 2-independence Addressing events like (A) and (B), let m be the number of elements 
in the set Z considered, e.g., Z = X or Z = Y. We study the number of elements hashing below a given 
threshold p G [0, 1]. Assuming that the hash values are uniform in [0, 1], the mean is ^ = mp. Assuming 
2-independence of the hash values, the variance is mp(l — p) = (1 — p)jjL and the standard deviation is 
a = — p)fi. By Chebyshev's inequality, we know that the probability of a deviation by ra is bounded 
by 1/r 2 . Below we will only use that the relative standard deviation a bounded by l/^/Jl. 

For any given r < Vk/3, we will fix a and b to give a combined error probability of 2/r 2 . More 
precisely, we will fix a = r /yk and b = r/^fjk. This also fixes p = k/(n(l — a)). We note for later that 
a < 1/3 and a < b. This implies 

(1 + 6)/(l - a) < (1 + 36) = 1 + 3r/77I (4) 

In connection with (A) we study the number of elements from X hashing below p. The mean is pn > k so 
the relative standard deviation is less than 1 / \fk. It follows that a relative error of a = r / \fk corresponds 
to at least r standard deviations, so 

P A = p r [#{x G X\h{x) < p} < (1 - a)np] < 1/r 2 . 

In connection with (B) we study the number of elements from Y hashing below p. Let m = \ Y\. The mean 
is pm = km / (n ( 1 — a) ) and the relative standard deviation less than 1 / ^Jpm < 1 / km/n. It follows than 
a relative error of b = r / \J km/n is more than r standard deviations, so 

P B = Pr [#{y G Y\h(y) < p} > (1 + b)mp] < 1/r 2 . 

By Proposition [2] we conclude that the probability of © is bounded by 2/r 2 . Rewriting (O with ©, we 
conclude that 

Pr \\Y DS\ > fk + 3ry/Jk] < 2/r 2 . (5) 

This bounds the probability of the positive error in (Q3- The above constants 3 and 2 are moderate, and 
they can easily be improved if we look at asymptotics. Suppose we want good estimates for subsets Y of 
frequency at least f m m, that is, \Y\ > / m in|^|- This time, we set a = r / 1 \fjk, and then we get Pa < / /r 2 . 
We also set b = r/y/Jk preserving Pb < 1/r 2 . Now for any Y C X with \ Y\ > fn, we have 

Pi[\YnS\ > {l + e)fk\ = (l + /)/r 2 (6) 

l + r/V/fc 1 r/Vk + r/^fk 

where e = — — 1 = — — — . 

1 — r/yk 1 — r/yfc 

With / = o(l) and k = co(l), the error is e = (1 + o(\))r / ^fJk, and the error probability is P £ = (1 + 
f)/r 2 = (1 + o(l))/r 2 . Conversely, this means that if we for subsets of frequency / and a relative positive 
error s want an error probability around P £ , then we set r = yj\jP E and k = r 2 / (/e 2 ) = l/(f P e e 2 ). 
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2.2 A union lower bound 



We have symmetric bounds for underestimates: 

\Y^S\<\^fk. (7) 
1 + a' 

This time we define the threshold probability p' = n ^ a ,^ ■ It is easy to see that the overestimate (f3]) implies 
one of the following two events: 

(A') The number of elements from X below p' is at least k. We expected p'n = k/(l + a') elements, so k 
is a factor (1 + a') above the expectation. 

(B') Y gets less than (1 — b')p\Y | hashes below p' , that is, a factor (1 — 6') below the expectation. 

To see this, assume that both (A') and (B') are false. When (A') is false, we have less than k hashes from X 
below p' , so S must contain all hashes below p' . Now if (B) is also false, we have at least (1 — 6)p'|y| = 
(1 — 6)/(l + a) ■ fk elements from Y C X hashing below p' , hence which must be in S. This contradicts 
([7]). By the union bound, we have proved 

Proposition 3 The probability of the underestimate ([71) is bounded by Pa< + Pb' where Pa' and Pb 1 are 
the probabilities of the events (A 1 ) and (B'), respectively. 

Lower bound with 2-independence Using Proposition[3]we will bound the probability of underestimates, 
complementing our previous probability bounds for overestimates from Section l27fl We will provide bounds 
for the same overall relative error as we did for the overestimates; namely 

e = \^-l = {a + b)/{l-a) 
I — a 

However, for the events (A') and (B') we are going to scale up the relative errors by a factor (1 + a), that is, 
we will use a' = a(l + a) and b' = 6(1 + a). The overall relative negative error from (0 is then 

s' = l- = (a' + + a') 

1 + a' 

< (1 + a)(a + 6)/(l + a') < (a + 6) < e. 

Even with this smaller error, we will get better probability bounds than those we obtained for the overesti- 
mates. For (A) we used l/Vk as an upper bound on the relative standard deviation, so a relative error of a 
was counted as sa = ayk standard deviations. In (A') we have mean // = np' = k/{l + a'), so the relative 
standard deviation is bounded by 1/ y/k/(l + a') = y/l + a + a 2 '/Vk. This means that for (A'), we can 
count a relative error of a' = a(l + a) as 

s' A = a (l + a)Vk/\/l + a + a 2 
= sa(1 + a)/V 1 + a + a 2 > sa 

standard deviations. In Section |27f1 we bounded Pa by 1/ s 2 A , and now we can bound P A > by l/s'/ < l/s\. 
The scaling has the same positive effect on our probability bounds for (B')- That is, in Section ITTfl a relative 
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error of b was counted as sb = b^fjk standard deviations. With (B ; ) our relative error of b' = 6(1 + a) is 
counted as 



s'b 



b(l + a)\fjk/\jl + a + a 2 



= sb(1 + o)/yl + a + a 2 > 

standard deviations, and then we can bound Pg/ by 1/ s' B 2 < l/s 2 ^. Summing up, our negative relative error 
e' is smaller than our previous positive error e, and our overall negative error probability bound l/s' A 2 + 
l/s' B is smaller than our previous positive error probability bound 1/sa 2 + 1/s_b 2 . We therefore translate 
©to 



Pr 



\YnS\< fk- 3ryffk] < 2/r 2 . 



which together with © establishes ©. Likewise © translates to 

Pi[\\YnS\- fk\> efk)] < 2(1 + f)/r 2 

1 + r/^fk 
where e = = — 1 . 



(8) 



(9) 
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•/y/k 



As for the positive error bounds we note that with / = o(l) and k = u(l), the error is s = (1 + o(l))r / y/Jk 
and the error probability is P £ = (2 + o(l))/r 2 . Conversely, this means that if we for a target relative error 
e want an error probability around P £ , then we set r = \j2jP £ and k = r 2 j (fe 2 ) = 2/(f P e e 2 ). 



2.3 Rare subsets 

We now consider the case where the expected number fk of samples from Y is less than 1/4. We wish to 
prove © 

IV V n s\ > i] = 0{fk/i 2 + \/f/£). 

For some balancing parameter c > 2, we use the threshold probability p = ck/n. The error event (A) 
is that less than k elements from X sample below p. The error event (B) is that at least £ elements hash 
below p. As in Proposition 12 we observe that £ bottom-A; samples from Y implies (A) or (B), hence that 
Pr[|ynS| >£}< Pa + Pb- 

The expected number of elements from X that hash below p is ck. The error event (A) is that we get 
less than k, which is less than half the expectation. This amounts to at least yck / 2 standard deviations, so 
by Chebyshev's inequality, the probability of (A) is Pa < l/(Vck/2) 2 = 4/(cfc). 

The event (B) is that at least £ elements from Y hash below p, while the expectation is only fck. As- 
suming that t > 2 fck, the error is by at least (t/2)/ \J fck standard deviations. By Chebyshev's inequality, 
the probability of (B) is P B < \/{{l/2)/ ^/Jck) 2 = Afck/£ 2 . Thus 

Pa + Pb <4/(ck)+4fck/£ 2 . 

We wish to pick c for balance, that is, 

4/ck = 4fck/£ 2 c = £/(y/fk) 

However, we have assumed that c > 2 and that £ > 2fck. The latter is satisfied because 2fck = 
Vk£/{^Jk) = 2yff£ and / < 1/4. Assuming that c = £/(VJk) > 2, we get 

Pa + Pb < 8/(k(£/(y/fk))) = 8y/f/£. 
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When £/{y/Jk) < 2, we set c = 2. Then 

Pa + Pb < 2/k + 8fk/£ 2 < Wfk/£ 2 . 

Again we need to verify that £ > fck = 2fk, but that follows because £ > 1 and fk < 1/4. We know that 
at one of the above two cases applies, so we conclude that 

P[\Y nS\>£]<P A + P B = 0(fk/£ 2 + y/f/£), 

completing the proof of ©. 

3 Priority sampling 

We now consider the more general situation where we are dealing with a set I of weighted items with Wi 
denoting the weight of item iG J. Let ^ I = w i denote the total weight of set /. 

Now that we are dealing with weighted items, we will use priority sampling [17 ] which generalizes the 
bottom-A; samples we used for unweighted elements. The input is a set of / of weighted items. Each item 
or element i is identified by a unique key which is hashed uniformly to a random number hi G (0, 1). The 
item is assigned a priority % = Wi/hi > Wi. We assume that all priorities end up distinct and different from 
the weights. If not, we could break ties based on an ordering of the items. The priority sample S of size 
k contains the k samples of highest priority, but it also stores a threshold r which is the (k + l)th highest 
priority. Based on this we assign a weight estimate Wi to each item i. If % is not sampled, w% = 0; otherwise 
Wi = max{wj, t}. A basic result from ifTTl is that E[u>,] = Wi if the hash function is truly random (in iTTTl . 
the hi were described as random numbers, but here they are hashes of the keys). 

We note that priority sampling generalize the bottom-A; sample we used for unweighted items, for if 
all weights are unit, then the k highest priorities correspond to the k smallest hash values. In fact, priority 
sampling predates |[T2l . and lfT2l describes bottom-A; samples for weighted items as a generalization of 
priority sampling, picking the first k items according to an arbitrary randomized function of the weights. 

The original objective of priority sampling iTTTl was subset sum estimation. A subset J C I of the items 
is selected, and we estimate the total weight in the subset as wj = € J n S}. By linearity of 

expectation, this is an unbiased estimator. A cool application from IfTTl was that as soon as the signature 
of the Slammer worm [24 1 was identified, we could inspect the priority samples from the past to track its 
history and identify infected hosts. An important point is that the Slammer worm was not known when the 
samples were made. Samples are made with no knowledge on which subsets will later turn out to be of 
interest. 

Trivially, if we want to estimate the relative subset weight ^ J/ ^ I and we do not know the exact 
total, we can divide wj with the estimate wi of the total. As with the bottom-A; sampling for unweighted 
items, we can easily use priority sampling to estimate the similarity of sets of weighted items: given the 
priority sample from two sets, we can easily construct the priority sample of their union, and estimate the 
intersection as a subset. This is where it is important that we use a hash function so that the sampling from 
different sets is coordinated, e.g., we could not use iterative sampling procedures like the one in IfTTl . In the 
case of histogram similarity, it is natural to allow the same item to have different weights in different sets. 
More specifically, allowing zero weights, every possible item has a weight in each set. For the similarity we 
take the sum of the minimum weight for each item, and divide it by the sum of the maximum weight for 
each item. This requires a special sampling that we shall return to at the end. 

Priority sampling is not only extremely easy to implement on-line with a standard min-priority queue; it 
also has some powerful universal properties in its adaption to the concrete input weights. As proved in [32 ], 
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given one extra sample, priority sampling has smaller variance sum J2i Varfwj] than any off-line sampling 
scheme tailored for the concrete input weights. In particular, priority sampling benefits strongly if there are 
dominant weights wi in the input, estimated precisely as W{ = max{wj,r} = W{. In the important case 
of heavy tailed input distributions HI, we thus expect most of the total weight to be estimated without any 
error. The quality of a priority sample is therefore often much better than what can be described in terms of 
simple parameters such as total weight, number of items, etc. The experiments in ifTTl on real and synthetic 
data show how priority sampling sometimes gains orders of magnitude in estimate quality over competing 
methods. 

The quality of a priority estimate depends completely on the distribution of weights in the input, and 
often we would like to know how much we can trust a given estimate. What we really want from a sample is 
not just an estimate, but a confidence interval for the subset sum l33l : from the sample we want to compute 
lower and upper bounds that capture the true value with some desired probability. The confidence interval 
does not have to be a simple nice function. It has to be efficiently computable, and we want it to be as tight 
as possible. 

All the current analysis of priority sampling H17ll32j|33l is heavily based on true randomness, assuming 
that the priorities are independent random variables, e.g. , the proof from [ 17 1 that E[wi] = Wi starts by fixing 
the priorities qj of all the other items j ^ i. However, in this paper, we want to use hash functions with 
independence as low as 2, and then any such approach breaks down. Estimates will no longer be unbiased, 
but we will show that good confidence intervals can still be computed. 

To the best of our knowledge, our paper is the first to show that anything useful can be said about subset 
sum estimates based on given number of k samples made using < 8-independent random variables. The 
essence of our result is that priority sampling gets reasonable performance with any hashing scheme as long 
as it is 2-independent. 

Below we first develop error probability bounds for priority sampling with limited independence applied 
to a given set of input weights. Afterwards we show how confidence intervals can be derived from a sample. 
At the very end, we show how to estimate histogram similarity. 

3.1 Threshold sampling 

Generalizing the pattern for unweighted sets, our basic goal is to relate the error probabilities with priority 
sampling to the much simpler case of threshold sampling for weighted items [16]. In our description of 
threshold sampling, we will develop notation and basic results that will later be used for priority sampling. 

In threshold sampling, we do not have a predefined sample size. Instead we have a given threshold t. We 
will still use exactly the same random priorities as above, but now an item is sampled if and only if qi > t. 
The weight estimate is 



Thus, if priority sampling leads to threshold t, then the priority estimates are Wi = wj. We shall use P 
to denote the items sampled with threshold t, that is, items i with qi > t. The priority threshold r is the 
[k + l)st largest priority, hence the smallest value such that \I T \ < k. 

We note that threshold sampling is well-known from statistics (see, e.g., |29l ). With fully random and 
independent hi, each item i is sampled independently: a so-called a Poisson sampling scheme. With the 
given weights Wi and threshold t, let kt be the expected number of samples with threshold t. Among all 
possible Poisson sampling schemes with unbiased estimators Wi and an expected number of kt samples, 
threshold sampling is the unique scheme that minimizes the variance sum J2iei Var[iUj] E2 P 86]. 




(10) 
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Fractional subsets and inner products It is convenient to generalize from regular subsets to fractional 
subsets. For each i E I, there is a fraction /, G [0, 1]. We want to estimate fw denoting the inner product 
Yl,i£i fi w i- We estimate it as fw 1 denoting Yliei f*™i = Sigs fi™l- Note that for this estimate, we only 
need to know fa for i G S. To emulate a standard subset J, we let / be the characteristic function of J, 
that is, f i = 1 if i G J; otherwise fa = otherwise. Using inner products will simplify a lot of notation in 
our analysis. The generalization to fractional subsets comes for free in our analysis which is all based on 
concentration bounds for sums of random variables Xi G [0, 1]. 

Notation for smaller weights Whenever we with threshold or priority sampling end up with a threshold 
t, we know that variability in the estimate is from items % with weight below t. We will generally use a 
subscript <t to denote the restriction to items i with weights wi < t, e.g., I <t = {i G I\wi < t} and 
w <t = (wi)i£i <t is the vector of these smaller weights. Then fw<t = Yliei < fi w i- Notice that fw<t does 
not include i with Wi > t even if f$Wi < t. 

Above we defined w <t to denote the vector (wj)j g / <t of weights below t, and used it for the inner 
product fw <t = Yliei t fi w i- When it is clear from the context that we need a number, not a vector, we 
will use w<t to denote the sum of these weights, that is, w <t = Yliei <t w i = X w <t where 1 is the all Is 
vector. Since fi <1 for all i, we always have fw<t < ^<t- 

We shall use subscript < t , > t , and >t to denote the corresponding restriction to items with weight < t, 
> t, and > t, respectively. 

For a subset J of /, we let fj = Yli^j fi' tnus identifying J with its characteristic vector. As an 
example, we can write our estimate with threshold t as 

f*S t = fw> t + t(fI t <t ). (11) 

Error probability functions As in the unweighted case, the point in relating to threshold sampling is that 
error probability bounds for threshold sampling are easily derived. For items with unit weights, we reduced 
the bottom- A; sampling error event to the union of four threshold sampling error events (A), (B), (A), and 
(B'). However, now with weighted items, we are going to reduce the priority sampling error event to the 
union of an unbounded number of threshold sampling error events that happen with geometrically decreasing 
probabilities. Our reduction will hold for most hash functions, including 2-independent hash functions, but 
to make such a claim clear, we have to carefully describe what properties of the hash functions we rely on. 

Assume that the threshold t is fixed. With reference to (fTTI i. the variability in our estimate is all from 
An item i G is sampled and included in P <t if qi = Wi/hi > t hi < Wi/t, hence with 

probability Wi/t. If i is sampled, it adds fi G [0, 1] to fl l <t ; otherwise 0. 

The above is an instance of bounded random variables X{ G [0, 1], i G /<t, where each X{ is a function 
of hf, namely Xi = fi[hi < Wi/t]. With the fixed threshold t, Xi depends only on the hash hi. Therefore, 
if the hash values hi are, say d-independent, then so are the Xj. Let X = Yliei <t Xi and /x = E[X). We are 
interested in an error probability function p such that for fi > 0, 5 > 0, if \x = E[X], then 

PT[\X-ti\>Sn]<p(jjt,S). 

The error probability function p that we can use depends on the quality of the hash function. For example, 
if the hash function is 2-independent, then Var[X] < fj,, and then by Chebyshev's inequality, we can use 

p Chebyshev (/U)(5) = 1/((5 2^ (12) 
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In the case of full randomness, for 5 < 1, we could use a standard 2-sided Chernoff bound (see, e.g., [25 ]) 

p Chernoff 5 < l(M) = g (13) 

For most of our results, it is more natural to think of 5 as a function of /i and some target error probability 
P G (0, 1), defining <5(/i, P) such that 

v(fJL,5(fx,P))=P. (14) 

Returning to threshold sampling with threshold t, for i G /<t, we have Xj = £ P <t ], X = fl l <t , and 
M = f w <t/t- Moreover, by (fTTT i. /{D* — /to = t(fP <t ) — /w<t- Hence 

Px[\fw* - fw\ > S(fw <t /t, P)fw <t ] < P (15) 

When we start analyzing priority sampling, we will need to relate the probabilities of different threshold 
sampling events. This places some constraints on the error probability function p. Mathematically, it is 
convenient to allow p to attain values above 1, but only values below 1 are probabilistically interesting. 

Definition 4 An error probability function p : M>o X P>o — > R>o is well-behaved if 

(a) p is continuous and strictly decreasing in both arguments. 

(b) If with the same absolute error we decrease the expectancy, then the probability goes down. Formally 
if p! < fx and p'6' > p,5, then p(/J,', 5') < p(p,, 5). 



We also have an optional condition for cases where we only care for 5 < 1 as in lil j| ) 

(c) If 5 < 1 and p(p,,5) < Pp for some constant P p depending on wp, then p(fi,5) falls at least 
inversely proportional to p5 2 . Formally, if5o,5i < 1, p(po^o) < 1> an d ^o^o < tnen 

KMo,*o) > ^xPiViih)- (16) 

We will use condition (c) to argue that probabilities of different events fall geometrically. The condition 
is trivially satisfied with our 2-independent Chebyshev bound ((12l) . so we can just set P^chcbyshcv = 1- The 
restrictions in (c) are necessary for the Chernoff bound (TTJI ). p chernoff <5<i = 2 exp(— 5 2 /j,/3), which only falls 

quickly enough for 5 2 p/3 > 1, hence with p ChemoS s<i r„ $\ < P^^^ = 2/e. As a further illustration, 

p - 

with 4-independence, we have the 4th moment bound ^^jl ( see > e -g-> ll2Tl Lemma 4.19]). For p, > 1, 
this is upper bounded by p 4th " moment M>i $) = (jjjp) ■ F° r 8 < 1, the condition fj, > 1 is satisfied if 

p 4th moment^ ^ < ^ SQ w£ Qw ^ us£ pW _ = L 

As an application of (a) and (b) we get 

Lemma 5 For thresholds t' , t, and relative errors 5' , 5, ift' < t and p^fw^/t', 5<f) = p(fw<t/t, 6<t), 
then 

S'fw <t > < 6fw <t - 
Hence, for any fixed target error probability P in < \15i , the target error 

8(fw <t /t,P)fw <t 

is strictly decreasing in the threshold t. 
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Proof We will divide the decrease from t to t' into a series of atomic decreases. The first atomic "decrease" 
is from fw<t to fw<t- This makes no difference unless there are weights equal to t so that fw<t < fw<f 
Assume this is the case and suppose p(fw < t/t, 5 <t ) = p(fw<t/t,5<t)- Since fw<t/t < fw^/t, it 
follows directly from (b) that 5<tfw<t/t < 5<tw<t/t, hence that <5<t/itf<t < 5<tw<t- 

The other atomic decrease we consider is from fw^ to fw< t ' where t' < t and with no weights 
in (t',t), hence with fw<t> = fw<t- Suppose p(fw<t'/t',5<t>) = p(fw < t/t,5 < t). Since t' < t, 
fw<t'/t' > fw<t/t, so by (a), 5< t > < 5 <t >. It follows that S<t>fw< t ' < &<tfw<t- Alternating between 
these two atomic decreases, we can implement an arbitrary decrease in the threshold as required for the 
lemma. ■ 



Threshold confidence intervals In the case of threshold sampling with a fixed threshold t, it is now trivial 
to derive confidence intervals for the true value fw. The sample gives us the exact value fw>t for weights 
at least as big as t, and an estimate fw<t f° r those below. Setting 

fw<t = m ^ n { x I (1 + S(x/t, P))x > fw^f} 
fw< t = max{x | (1 — 5(x/t, P))x < /w)> 4 } 

we get 

Pr [fw> t + fw- t < fw < fw> t + fw+t] > 1 - 2P. 

3.2 Priority sampling: the main result 

We are now ready to present our main technical result. 

Theorem 6 For items i G I, let be given a weight vector (itfj)j g j with corresponding fractions (/j)ig/. 

Let the error probability function p satisfy Definition\4\including (c). With a given target error probability 

(c) 

P < Pp and sample size k, consider a random priority sampling event, assigning to each item i £ I, a 
hash hi € (0, 1) and priority qi = Wi/hi. Let r be the resulting priority threshold, i.e, the k + 1th largest 
priority. Let 

5 = 65(fw <T /(3r),P). 

If5<2, then 

Pi[\fw T - fw\>5fw <T ] = 0(P). 

The above constants are not optimized, but with pure O-notation, it is not as easy to make a formally clear 
statement. Ignoring the constants and the restriction 5 < 2, we see that our error bound for priority sampling 
with threshold r is of the same type as the one in (fT3T > for threshold sampling with fixed threshold t = r. In 
our case, the priority threshold r is variable, and from Lemma |5]it follows that our probabilistic error bound 

Q8(fw <T /(3r),P) fw <T 

decreases with the priority threshold r. 

The proof of Theorem [6] is rather convoluted. With some target error probability P, we will identify 

imin, ^rnax SUCh that 

(i) With probability 1 — O(P), the priority threshold r G [t m i n , t max \. 
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(ii) With probability 1 — O(P), with a single random choice of the hi but simultaneously for all t S 

[^minj ^max 

], if 5 = 66(fw <t /(3t),P) < 2, then \fw T - fw\ < 5fw <t . 

A union bound on -i(i)V-i(ii) implies Theorem [6] What makes (ii) very tricky to prove is that 
£(/w<t/(3t), P) can vary a lot for different t £ [t min 

Priority confidence intervals The format of Theorem [6] makes it very easy to derive confidence intervals 
like those for threshold sampling. A priority sample with priority threshold r gives us the exact value fw> T 
for weights at least as big as r, and an estimate fw T <T for those below. For an upper bound on fw <T , we 
compute 

fw+ T = max{i | <5 = 6 <5(x/(3t), P)) A (1 - S)x < fw> T }. 

Note that here in the upper bound, we only consider 5 < 1, so we do not need to worry about the restriction 
5 < 2 in Theorem [6] For the lower bound, we use 

fw~ T = mm{x | 5 = 6 5(x/(3t), P)) < 2 A (1 + 5)x > fw> T }. 

Here in the lower bound, the restriction 5 = 6 5(x/(3t), P)) < 2 prevents us from deriving a lower bound 
x = fw< T < /u)> r /3. In such cases, we use the trivial lower bound x = fw< T = which in distance 
from fw T <T is at most 1.5 times bigger. Now, by Theorem [6] 

Pr[/^> T + /^< T < fw < fw> T + fw+ T ] >1-0(P). 

In cases where the exact part fw> T of an estimate is small compared with the variable part fw> T , we may 
be interested in a non-zero lower bound fw< T even if it is smaller than /{D> T /3. To do this, we need bounds 
for larger 5. 

Large errors We are now going to present bounds that works for arbitrarily large relative errors 5. We 
assume a basic error probability function p satisfying Definition 0] (a) and (b) while (c) may not be satisfied. 

The bounds we get are not as clean as those from Theorem [6] but we include them to show that something 
can be done also for S > 2. In particular, this means that we only worry about positive errors. 

Theorem 7 For items % € I, let be given a weight vector (wi)i^j with corresponding fractions (fi)iei> a 
target error probability P, and a sample size k. Based on (wj)jgj, let t max the smallest upper bound on a 
random priority threshold that is exceeded with probability at most P, that is, the probability of generating 
at least k + 1 priorities above t max is at most P. Consider a random priority sampling event. Set 

1 = 1 + log 2 (t max /r) 
5 = S(fw <T /r, Pjt 2 ). 

Then 

Pr[/^ T > (2 + 25)fw <T ] = O(P) 

Complementing Theorem |6l we only intend to use Theorem [7] for large errors where (1 + 25) = 0(5). 
We wish to provide a probabilistic lower bound for fw <T . Unfortunately, we do not know i max which 
depends on the whole weight vector (wj)j e /. However, based our priority sample, it is not hard to generate 
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a probabilistic upper bound t max on i max such that Pr[i max < t max ] < P. We then compute i = 1 + 

log 2 (imax/T) and set 

fw- T = mm{x | 8 = 5(x/t, P/f)) A 2(1 + 8)x > fw T <T }. 

Then by Theorem [7J 

Pr[/^< r > /«5< T ] =1-0(P). 

To see this, let fw* <T be the value we would have obtained if we had computed fw< T using the real i max . 
Our error event is that i max < i max or fw <T < fw* <T . The former happens with probability at most 
P, and Theorem [7J states that the latter happens with probability O(P). Hence none of these error events 
happen with probability 1 — O(P), but then t max > i max , implying fw< T < fw* <T < fw <T . 

Theorem |6j our main result, is proved in Sections I3.3H3.5I Theorem [7J which is much easier, is proved 
in Section 13.61 In Section 13.71 we show how to compute the t max used for confidence lower bound with 
Theorem |7] 

3.3 The priority threshold 

To prove Theorem [6] we need a handle on the variable priority threshold, identifying the previously men- 
tioned probabilistic lower and upper bounds, t m - m and t max - 

With priority sampling we specify the number k of samples, and use as threshold r the (k + l)th priority. 
Recall for any threshold t that the subscript < t indicates restriction to items with weight below t. To relate 
this notation to a priority sample of some specified size k, we let k< t denote k minus the number of items 
with weight bigger than t. We define k <t accordingly. With threshold t, the expected number of samples is 
= k — k< t + w<t/t = k — k <t + itf<t/£. The last equality is because weights wi = t cancel out. 

The ideal threshold We define the ideal threshold t* to be the one leading to an expected number of 
exactly k samples, that is w<t* = t*k<t*- 

Lemma 8 i/c<t — w<t is strictly increasing in t, so (a) t* is uniquely defined, (b) w<t > k<ttfor any t < t*, 
and (c) w<t < k<ttfor any t > t*. 

Proof If t increases without passing any weight value, then k< t and w<t are not changed, and the statement 
is trivial. When t reaches the value of some weight Wi, then both tk<t and w<t are increased by the same 
value Wi (if there are j weights with value u?j, the increase is by jwi). ■ 

We would like to claim that the priority sampling threshold r is concentrated around £*, but this may be far 
from true. To illustrate what makes things tricky to analyze, consider the case where, say, we have k — 1 
weights of size t* , and then a lot of small weights that sum to t*. In this case we get an upper bound on r 
which is close to t*, but we do not get any good lower bound on r even if we have full randomness. On the 
other hand, in this case, it is only little weight that is affected by the downwards variance in r. 

Threshold upper bound We now present a certificate that we will use to bound the probability that the 
priority threshold exceeds a given threshold T. The certificate uses an extra parameter t < T which is a 
concrete weight value. In order to get a priority threshold as big as T, we need \I T \ > k + 1, and this 
requires > k< t + 1- The expected value of |/< t | is ^i< t = E[I< t ] = w<t/T. We define 5^. such that 

(i+0£ t = fc< t +i. 
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The error event is thus equivalent to |/< t | > (1 + <5<^)/i< 4 . The probability of the upper bound error event 

is bounded by p(/i< t ,5<^). We say that t and T witness that T is a threshold upper bound with error 
probability P if 

p t (t,T) = p( t il t ,5 t Z)<P. (17) 

It may seem wasteful that we make the conservative assumption that all weights between t and T get prior- 
ities at least T, but this is because we only want to bound probabilities using the simple function 6). 

Threshold lower bound We also need to introduce a lower bound t m ; n for the priority threshold such that 
the probability that it is violated is bounded by P. In our later analysis of error probabilities bounds, it will 
be crucial that we only need to consider thresholds t > t m i n - 

The basic condition we use here is that the priority threshold r is smaller than or equal to T if and only 

if \I< T \ < k<T- The expected value of /< T is /i< T = E[I< T ] = w<t/T. We define 5< T such that 

(1 - £< T V<T = k< T . 

The error event r < T is thus equivalent that |/< r | < (1 — <5< T )/i< T . The probability of the lower bound 

error event is bounded by p(//< T , 5< T ). We say that T witness that T is a threshold lower bound with error 
probability P if 

p l (T) = p(fil T ,5i- T )<P (18) 

Below t m - m is defined as the largest value witnessed this way. In the analysis below it will be crucial that 
imax is no bigger than any upper bound witness and that t m \ n is no smaller than any lower bound witness. 

By definition, Pr[r 6 (t m - m , ^max] > 1 — 2P. To prove Theorem |6]it therefore suffices to show that the 
following good non-error event happens with probability 1 — 0(P): 

Vt €: (iminj ^max] • 

5 = 5(fw <t /(3t),P) < 1/3 

\fvf - fw\ > 65fw <t (19) 

Note that our variable 5 is 6 times smaller than the one in the statement of Theorem [6] This turns out to be 
a more convenient choice for the analysis. 

3.4 Tightening the gap 

The following lemmas give us a much tighter understanding of t G (t m in, *max]- 
Lemma 9 Ift > t* then t(l + 5(w< t /t mSLX , P)) > t mSLX . 

Proof Let 5 = 5(w<t/t m & K , P) and T = (1 + 5)t. Assume for a contradiction that T < t mSLX . We will 
prove that p t (t, T) < P, hence that T is an upper bound, contradicting the minimality of i max . 
By Lemma[8](c), since t > t*, we have w<t < tk< t , so 

k< t > w< t /t = (1 + 5)w< t /T 

For the priority threshold to be as big as T we need \I< t \ > k< t + L and we defined 5^[ t such that (1 + 

S<t) w <t/T = k< t + ^- It follows that S<t e > 8- Moreover, since T < t max , we have w< t /T > w<t/t mSLX , 
so conclude 

pHt,T) = p{w< t /T, < p(w< t /t max , 5) < P. 
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It follows that T is an upper bound contradicting the minimality of t 



Lemma 10 For t G [t min , t*] then t > (1 - S(w< t /t, P))t*. 



Proof The proof is by contradiction against our maximal lower bound i mm . Recall that the priority thresh- 
old is bigger than T if |/< T | < fc<r- The expectancy is E[|/< T |] = w<t/T, so for lower bounds, we defined 
5^ <T so that (1 - 5< T )w< T /T = k< T , getting T as a lower bound if p+(T) = p(w< T /T, 6^ T ) < P. 
Suppose for a contradiction that (1 — 5)w< t /t > Then, by definition, 5^ < 5< t . It follows that 

p l (t) = p(w< t /t,5{ t ) < p(w< t /t, 5) < P, 

implying that t > t mm is a lower bound. If t > t m - m this contradicts the maximality of t m i n . Otherwise, 
we pick an infinitesimally larger t' > t with no weights in (t, t'], that is, w< t ' = w<t- This yields an 
infinitesimally larger p^(t'), that is, p^(t) < p^(t') < P. Now if > t m - m is a lower bound contracting the 
maximality of t m - m . Thus we conclude that (1 — 5)w<t/t < k<t, or equivalently, t > (1 — 5)w<t/k<t- 
Finally by Lemma[8](b), we have w< t /k<t > t*. This completes the proof that t > (1 — 5g)t*. ■ 

In order to give a joint analysis for t bigger and smaller than t*, we make a conservative combination of 
Lemma|3and[l0l 

Lemma 11 Suppose (t~,t + ] = [t min ,t*] or (t~,t + ] = (t*,t max ]- If t € (t~,t + ] then t > (1 - 
S(w< t /t+,P))t+. 



Proof Let 5 = 5(w< t /t + ,P). If (t~,t + ] = (t*,t max ], by LemmalU t > t + /(l + S) > (1 - If 
= (t min ,t*], by Lemma[M t > (1 - 5(w< t /t, P))t + > (1 - <5)i+. The last inequality uses that 
5(w<t/t, P) < 5. This is because w<t/t > w<t/t + , so by Definition 0] (a), p(w<t/t,5) > p(w<t/t + ,5) = 
P. m 

Loosing a factor 2 in the error probability to cover t bigger or smaller than t*, our good event ([T9T i reduces 
to 

Vt € (t-,t+] : 5 = 8(fw <t /(3t),P) < 1/3 

- > 6,5/ U ;< i (20) 

By Lemma [TT1 the bound on 5 implies that t > (2/3)t + . Therefore d20l is implied by 

Vt G : 5 = 5{fw <t /{2t + ),P) < 1/3 

- /w| < 6<y/«;< t (21) 

One advantage of dealing with fw <t /t + instead of fw <t /t is that /w)<t/t + is proportional to /t?<t hence 
monotone in t whereas fw<t/t is not monotone in t. In particular, if there is a t G such that 

5(fw <t /(2t + ),P) > 1/3, then there is a f~ such that 5(fw <t ,-/(3t+), P) < 1/3 if and only if t > t'~ . 
Otherwise we set t'~ = t~. Then (|2TI ) is equivalent to 

Vte (t'-,t + }:8 = S(fw <t /(2t+),P) 

=> \fw* - fw\ < 65fw <t (22) 

and for t G t+], we know that 5(fw <t /(2t + ), P) < 1/3. 
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3.5 Dividing into layers 

We now define a sequence to > tj. > • • • > of decreasing thresholds with t = t + and t^ +1 = t'~ . 
For ^ = 0, .., L we require 

fw<t e+1 > fw <ti /2. (23) 
For £ < L, we pick t^ + i smallest possible satisfying (l23l . so 

/™<i* +1 < fw <t j2, (24) 

We arrive £ = L when fw <t ,- > jw < t t j2. Then we set t^+i = t'~. 
For each "layer" £ < L, we define 

^ = <y(/to< t< /(2t + ),P), (25) 



noting that this is the same value as we would use for t = tg in (1221) . By definition, for all £ < L, we 
have p(fw < t e /{2t + ), 5g) = P- Since t^ > f'~, we have 5z < 1/3 < 1, so it follows from Definition 0] 
(c) that there is a constant C such that fw^S] = C for all £ < L. For £ = 1,...,L, by (l24l . we have 

/™<t* < /^<^_i/2. Therefore 

5/ > V2^_! (26) 
$efw<t e < (2V) 

This will correspond to an effect where the relative errors are geometrically increasing while the absolute 
errors are geometrically decreasing. Another important thing to notice is that by d23l . fw<t e+1 > fw^/2, 
so S(fw <te+1 /t+, P) < 5(fw <te /{2t+),P) = Si. Therefore, by LemmaHI] 

h+i > (1 - St)t+. (28) 

Good layers For each layer £ < L, our good event will be that for weights in [ti+i,ti), the relative 
estimate error is bounded by 25^. Formally 

Vt G (t e+1 ,t + ] : fw\ tl+lM ~ fw [te+1 , te ) 

< Mefw[tt +1 ,tt) (29) 

Above, the subscript [t t+u t t ) denotes the restriction to items i with weights w\ G [t^+i, ti). The last layer L 
is special in that we want to consider all weights below i£. Here the good event is that 

VtG (t i+ i,t + ] :\f& <tt -fw <th \ 

< 65 L fw <tL (30) 

To prove Theorem [6] we are going to prove two statements. 

• Assume that all layers are good satisfying (|29l and (|30l . If for any t G (t' _ , t + ] we add up the errors 
from all relevant layers, then the total error is bounded by Q6(fw < t/(2t + ),P)fw < t, so (l22l satisfied. 

• If Ft is the probability that a layer £ fails, then the are geometrically increasing and Pl = O(P), 
so by union, the probability that any layer fails is 0{P). 
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Adding layer errors Assuming that all layers are good satisfying (l29l and (l30l . we pick an arbitrary 
threshold t G (t'~, t + ]. We which to bound the estimate error \ fw t — fw\. 

Let h be the layer such that t G (t/i-i, i/i]. We can only have estimate errors from weights below t < th, 

so 

L-l 

+ \f™<t L ~ f w <t L \ 
L-l 

+ 35 L fw <tL 

L-l 

= X 26 e(f w <ti ~ f w <h+i) 
e=h 

+ 35 L fw <tL 

By (l26l ). the <5^ are increasing, so in the above sum, every fw < t e appears with a positive coefficient. It follows 
that we could only get a larger sum if (l24l i was more than tight with fw < t e = fw < t e _ 1 /2 for £ < L. Then 
we would have fw <te - fw<t t+1 = fw <t j2 and corresponding to <|27>, 5efw <te = 5^ifw <tl _J y/2. 
Thus we get 

L-l 

1/5;* - fw\ <J2 2 W w <t e ~ fw <te+1 ) 
e=h 

+ 35 L fw <tL 

L-l 

5 h fw <th /V2 

t=h 

+ 35 h fw <th /V2 L 
< 35 h fw <th 

The last inequality exploits that X^iVv^ = 1/(1 — I/a/2) < 3. Since t < t^, we have 

S(fw <t /(2t+), P) > 5(fw <t J(2t+),P) = 5 h . Also, t > t h+1 , so by @3), /w <t > /w< tfc+1 > /«;<t h /2, 
so 

8{fv) <t /(2t+),P) fw <t > S h fw <th /2. 

Therefore 

< 35 h /«;< tk < 65(/™< t /(2i+),P)/™< t . 
Thus we conclude that <[22]> follows from (J29]» and (l30l . 

Intermediate layer error probabilities We now consider the intermediate layers £ = 0, L — 1. We 
want to show that the probability Pg of violating (l29l increases geometrically with yet remains bounded 
by P. First we consider the upper bound part of (129) 

Vt G = < (1 + 2 h)fw [tl+lM) . (31) 
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We claim that it can never be violated. The worst that can happen is that every item i in the layer gets 
sampled, and the estimate is at most fit + . However, the items all have weight at least tp + \ and by (|28T ). 
te+i > (1 — 5e)t + - The increase is thus by at most a factor t + /tg+% < 1/(1 — <5&fi), which for 8g, < 1/3 is 
less than (1 + 25 g). Thus (I3TT ) is satisfied regardless of the random choices. 
We now consider the lower bound part of (129) 

Vi G (t £+1 ,t+] : /^ +1>t<) > (1 - 2^)/w [tl+1)ti) . (32) 

This event could happen. To bound the probability, we will focus on the loss fw^ e+ltt ^ — fwt t 1 t( y When 
bounding the loss, we do not consider the gain from possible over estimates of sampled items. We only 
consider the actual losses fiWi from unsampled items i. In fact, we consider an item i lost if % < t + . This 
includes any item unsampled with some threshold t < t + . The loss for every threshold t < t + is thus 
bounded as 

i:Wi€(te +1 ,t e } 

We know that Wi > tg + \ > (1 — 5)t + . Therefore 

Prfe < t+] = Pr[h t > Wi /t + ] < Pr[hi > t l+1 /t + ] 
= l-t e+1 /t+ = 5 e . 

The expected loss from layer I is thus bounded by 

SifiWi = S e fw [ti+ut& 

i:Wi<=[t e+1 ,t e ) 

For (l32l to fail, we need a loss that is twice this big, that is, 25gfw\t e We know that items i's loss 
contribution < t + ]fiWi depends only on hi and that it is at most t + . The probability of violating (l32l is 
therefore bounded by 

p(5 e fw [te+ute) /t + , 1) < p(5 e fw <te) /(2t + ), 1). 

Let fj,£ = Sifw^j '{2t + ). Then Pi = p(m, 1) is our bound on the error probability. From (l27l . we know 
that Sifw^ < ^-i/w<t ( ,_ 1 /v / 2, so m < ^_i/v2. It follows from Definition [4] (c) that p(fi£, 1) > 
v2p(w-i, 1), so the Pg are geometrically increasing, and their sum is bounded by 3Pl-i- 

Finally, by definition d25l) , p(fw < t L _ 1 /(2t + ), <5l-i) = -P- To compare P with Pl_i, by Definition 
H](c), we compare fw < t L _ 1 / {2t + )5 2 L _ 1 with /i^_il 2 = 5^_i/ti; <tL _ 1 /(2t + ), and conclude that Pt-i < 
5l-\P < P/3. The probability that any intermediate layer £ < L fails (l32l is thus at most P. Since (|32~1) 
was always satisfied, we conclude that (l30l) is satisfied for all layers < L with probability P. 

The last layer We now consider items i with weights below t^. On the upper bound side, the good event 
(f30l) states that 

Vt G (tL+i,t + ] : /w< t£ < (1 + 3tfj)/u;< ti . (33) 

We will show that (l33l fails with probability less than P. For an upper bound on the estimate with any 
threshold t G (*l+i, f + ], we include item i if > i^+i, and if so, we give it at an estimate of which 
is bigger than the sampled estimate with threshold t < t + . The result is at most a factor t + /ti+i bigger 
than in the sampled estimate with threshold i^+i, and by ( |28T ), > (1 — 5i)t + . Thus, regardless of the 
random choices made, we conclude that 

Vt G (tL+i,t + ] : M iL < - <5l). 
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Consider the following error event: 

> (1 + 5 L )fw< tL (34) 
The maximal item contribution to fw<^ is bounded by <t + , so the probability of (1341 is bounded by 

p(fw< tL /t + ,5 L ) < p(fw< t J(2t + ),6 L )/2 = P/2. 
If d34l ) does not happen, then since Sl < 1/3, 

Vt G (tL+i,t + ] : M fi < (l + fe)/w< ti /(l " S L ) 

< (l + 35 L )/u;< ti , 

which is the statement of d33l . We conclude that d33l fails with probability at most P/2. 
We now consider the lower bound side of (l30l which states that 

Vt G (tL+i, t + ] : /^ t£ > (1 - 3^)/^< 4i (35) 

The analysis is very symmetric to the upper bound case. For a lower bound for weights Wi < th with any 
threshold t G we only include i if ^ > t + , and if so, we only give i the estimate fitL+i which is 

smaller than the sampled estimate with threshold t > iL+i- Our samples are exactly the same as those we 
would pick with threshold t + , and our estimates are smaller by a factor tL + \/t + > (1 — 6l), so we conclude 
that regardless of the random choices, 

Vt G (t L+1 ,t+] : fwt <tL > fwt tL (l ~ 8l) 

We now consider the following error event: 

fw% L < (1 - 5 L )fw< tL (36) 

The maximal item contribution to fw<t is t + , so as for (l34l . we get that the probability of (l36l is bounded 
by p(fw< tL /t+,5 L ) <P/2. 

If (l36l) does not happen, then since 5l < 1/3, 

Vt G (t L+1 ,t+] : /«?* <ti > (1 - ^)/^< fe (l - 5 L ) 

> (l-2S L )fw< tL , 

which implies (l35l) . We conclude that (l33l) fails with probability at most P/2. Including the probability of 
an upper bound error, we get that (l30T l fails with probability P. 

Summing up Above we proved that the probability that (129T ) failed for any layer I < L was at most 
P. We also saw that (l30l) failed with probability P. If none of them fail, we proved that (|22T ) and hence 
(l20l ) was satisfied, so (1201) fails with probability at most 2P. However, for (fT9l ) we need (|2D1 both for 
= [t m i n ,t*] and for = (t*,t max \, so (O fails with probability at most 4P. Finally, we 

need to consider both the case that r > t max and r < t mm . Either of these events happens with probability 
at most P, so we conclude that the overall error probability is at most 6P = O(P). If no error happened 
and 5 = 5(fw < t/(3t), P) < 1/3, then \ fw t — fw\ > 66 fw<t. This completes the proof of Theorem|6] 
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3.6 Large errors 

The limitation of Theorem[6]is that it can only be used to bound the probability that the estimate error \fw T — 
is bigger than 2fw <T . Note here that errors above fw <T can only be overestimates. Since we are targeting 
arbitrarily large relative errors S, for the probability function p, we can only assume conditions (a) and (b) 
in Definition 01 

We will use some of the same ideas as we used for the last layer in Section 13.51 but tuned for our 
situation. We will study intervals based on tg = t mSLX /2 l for I = 1,2,.... Interval £ is for thresholds 
t G (tg, tg-i] = (tg, 2tg\. To define the error for interval £, set 

h = P(fw <te /t e , P/£ 2 ). 

The good non-error event for interval I is that 

fw% t < (1 + Sg)fw <ti . (37) 

By definition, the probability that (137T ) is violated is at most P/£ 2 , so the probability of failure for any £ is 
bounded by YT=\ ^ Pn 2 /6 < 1.65P. By definition, the probability that 

t < t max (38) 

is violated is at most P, so our total error probability is bounded by 2.65P. Below we assume no errors, that 
is, (I37T ) holds for all £ and so does (l38l . 

Consider an arbitrary threshold t E (0, i max ] and let I be such that t G (tg, 2tg}. We can only have errors 
for weights wi < t, so we want an upper bound on fw^p The basic idea for an upper bound is to say that 
we sample all items with priority above tg, just as in the estimate fw< ti , but instead of giving sampled item 
i estimate maxjwj, < tg}, it gets value maxj^j, t} which is at most t/tg < 2 times bigger. Thus, regardless 
of the random choices, 

fw* <t < 2fw%. 

Assuming no error as in (l37T i. we get 

fw% = fw[ t)l)t] + fw% ( < fw <t + 5gfw <te . 

Let 

S = S(w <t /t,P/f) 
By Lemma[5l since t > tg, we have Sgfw < t e < dfw^, so 

fw% < f w <t + Sgfw <t£ < fw <t + 5fw <te . 

We thus conclude 

Vt G (0, t max ] : fw* <t < 2fw% < 2(1 + 8)fw <t , (39) 

By (l38l) this bound also holds for t = r. Relating to the parameters in Theorem [7] note that for given 
t < t m3iX , we picked £ = 1 + [log 2 t max \ < 1 + log 2 t m ax, so our relative error 5 is no bigger than that in 
Theorem [7] This completes the proof of Theorem [7J 
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3.7 Upper bounding the upper bound 



In Theorem [7J we defined i max to be smallest upper bound on a random priority threshold that is exceeded 
with probability at most P, that is, the probability of generating at least k + 1 priorities above i max is at 
most P. As described in Section 13721 in applications of Theorem UJ it suffices if we based on our sample can 
compute a probabilistic upper bound t max on the upper bound t max such that 

^max ^ ^max (40) 

with probability at least 1 — P. For better confidence lower bounds, we want t max to be small. If the weight 
vector (iWj)ie/ was known, then we could compute i max > t max as described in Section 1331 

When we do our priority sample and get priority threshold r, we get all weights above r exactly. The 
idea is now to compute a probabilistic upper bound w <T on w <T such that 

w <T > w <T . (41) 

with probability at least 1 — P. Given w <T , for any t > r, we define 



W <t = W[ T>t ) + W <tT 

If (l4TT > is true, then w <t > w <t for all t > r. Using these values in the upper bound from Section 13.31 we 
get a i max satisfying (l40l . 

It remains to derive a probabilistic upper bound w <T on the original weight below the priority threshold 
r. Based on the (partially unknown) original weight vector (wi)i & i, for any threshold value t, we define 
5t to be the largest value such that p(k <t /(l — 5t), 5t) < P. Moreover, define 'p t <t = k <t /(l — 5 t ) and 
w <t = tp?^. Note that all these values are functions of t and k <t that can be derived from the priority 
sample if t > r. The following lemma states that u7 <T thus defined does give us the desired probabilistic 
upper bound on w <T 

Lemma 12 For the random priority threshold r, the probability that w <T > w <T is at most P. 



Proof For any given set of input weights consider a threshold t such that uJ<£ < w<t- We claim that the 
random priority threshold r is expected no smaller than t. Note that r < t if and only if \I l <t \ < k <t . Since 
w<t < w<t, we have E[|/< t |] = w <t /t > ~p, <t . Moreover, k <T = (1 — 5t)JIt> so 

Pr[r <t} = Pr[|/y < k <t ] = Pr[|/^ t | < (1 - 6 t )]I t ] 
< p(-p <t ,S t ) = p(k <t /(l - S t ), S t ) < P. 

Let t + be the maximal value such that w <t + < w <t +. The probability that r < t + is at most P, and if 
r > t + , then w <T > w <T . If the maximal value t + does not exist, we define instead t + as the limit where 
w<t < iu<t for t < t+ while w <t + > w <t +. The probability that r < t + is at most P, and if r > t + , we 
again have uJ <T > u;< T . ■ 
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3.8 Histogram similarity 

As mentioned earlier, with weighted items, given the priority samples of sets A and B, we can easily estimate 
the weight of their intersection and union. As for bottom- A; sample, the point is that we can construct the 
priority sample of their union. This priority sample involves the top k + 1 priorities from A U B. If one of 
these is in A, then it must also be among the top k + 1 priorities from A. 

However, if we want to compare histograms, then it is natural to say that the same item i may 
have different weights in different sets. The item i has weight wf in A and weight wf in B. Let 
u> max = max.{wf,wf} and wf un = min{wf,wf}. The histogram similarity is w mm / w ma - x = 

(E,< in )/(E,< ax )- 

This would seem a perfect application of our fractional subsets with Wi = wf mx and fi = wf un /wf mx . 
The issue is as follows. From our priority samples over the wf and wf we can easily get the priority sample 
for the wi = u> max . However, for the i sampled, we would typically not have a sample with wf 1111 , and then 
we cannot compute /j. 

Our solution is to keep the instances of an item i in A and B separate as twins i A and i B with priorities 
qf = wf I hi and qf = wf /hi. For coordination, we still use the same hash value hi to determine the 
priorities. If wf = wf, we get qf = qf, and then we break the tie in favor of i A . The priority sample for 
AL)B consists of the top k split items, and the priority threshold r is the k + 1 biggest among all priorities. 
Estimation is done as usual. The important point here is the interpretation of the results. If wf > wf , then 
the priority of i A is higher than that of i B . Thus, in our sample, when we see an item i c , C G {^4, B}, we 
count it for the union w max if it is not preceded by its twin; otherwise we count it for the intersection w mm . 

The resulting estimators w mm and w max will no longer be unbiased with truly random hashing. How- 
ever, for our probabilistic error bounds with pseudo-random hashing, there is no asymptotic effect to our 
analysis. All the current analysis is using union bounds over threshold sampling events, using the fact that 
each hash value hi contributes at most 1 to the number of items with priorities above a given threshold t. 
Now hi affects 2 twins, but this is fine since all we need is that the contribution of each random variable is 
bounded by a constant. 
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